On the Convergence of Bound Optimization Algorithms
نویسندگان
چکیده
Many practitioners who use EM and related algorithms complain that they are sometimes slow. When does this happen, and what can be done about it? In this paper, we study the general class of bound optimization algorithms – including EM, Iterative Scaling, Non-negative Matrix Factorization, CCCP – and their relationship to direct optimization algorithms such as gradientbased methods for parameter learning. We derive a general relationship between the updates performed by bound optimization methods and those of gradient and second-order methods and identify analytic conditions under which bound optimization algorithms exhibit quasi-Newton behavior, and under which they possess poor, first-order convergence. Based on this analysis, we consider several specific algorithms, interpret and analyze their convergence properties and provide some recipes for preprocessing input to these algorithms to yield faster convergence behavior. We report empirical results supporting our analysis and showing that simple data preprocessing can result in dramatically improved performance of bound optimizers in practice. 1 Bound Optimization Algorithms Many problems in machine learning and pattern recognition ultimately reduce to the optimization of a scalar valued function L(Θ) of a free parameter vector Θ. For example, in supervised and unsupervised probabilistic modeling the objective function may be the (conditional) data likelihood or the posterior over parameters. In discriminative learning we may use a classification or regression score; in reinforcement learning an average discounted reward. Optimization may also arise during inference; for example we may want to reduce the cross entropy between two distributions or minimize a function such as the Bethe free energy. Bound optimization (BO) algorithms take advantage of the fact that many objective functions arising in practice have a special structure. We can often exploit this structure to obtain a bound on the objective function and proceed by optimizing this bound. Ideally, we seek a bound that is valid everywhere in parameter space, easily optimized, and equal to the true objective function at one (or more) point(s). A general form of a bound maximizer which iteratively lower bounds an objective function L(Θ) is given below: General Bound Optimizer for maximizing L(Θ): • Assume: ∃ G(Θ, Ψ) such that for any Θ′ and Ψ′: 1. G(Θ′, Θ′) = L(Θ′) & L(Θ) ≥ G(Θ, Ψ′) ∀ Ψ′ 6= Θ 2. arg maxΘG(Θ, Ψ ′) can be found easily for any Ψ′. • Iterate: Θ = arg maxΘG(Θ, Θ ) • Guarantee: L(Θ) = G(Θ, Θ) ≥ G(Θ, Θ) ≥ G(Θ, Θ) = L(Θ) Bound optimizers do nothing more than coordinate ascent in the functional G(Θ, Ψ), alternating between maximizing G with respect to Ψ for fixed Θ and with respect to Θ for fixed Ψ. These algorithms enjoy a strong guarantee; they never worsen the objective function. Many popular iterative algorithms are bound optimizers, including the EM algorithm for maximum likelihood learning in latent variable models[2], iterative scaling (IS) algorithms for parameter estimation in maximum entropy models[1], non-negative matrix factorization (NMF)[3] and the recent CCCP algorithm for minimizing the Bethe free energy in approximate inference problems[12]. In this paper we explore two questions of theoretical and practical interest: when will bound optimization be fast or slow relative to other standard approaches, and what can be done to improve convergence rates of these algorithms when they are slow? 2 Convergence Behavior and Analysis How large are the steps that bound optimization methods take? Any bound optimizer implicitly defines a mapping: M : Θ → Θ from parameter space to itself, so that Θ = M(Θ). If iterates Θ converge to a fixed point Θ, then Θ = M(Θ). If M(Θ) is continuous and differentiable, we can Taylor expand it in the neighborhood of the fixed point Θ: Θ − Θ∗ ≈ M (Θ)(Θ − Θ∗) (1) where M (Θ) = ∂M ∂Θ |Θ=Θ∗ . Since M (Θ) is typically nonzero, a bound optimizer can essentially be seen as a linear iteration algorithm with a “convergence rate matrix” M (Θ). Intuitively, M (Θ) can be viewed as an operator that forms a contraction mapping around Θ. In general, we would expect the Hessian ∂ L(Θ) ∂Θ |Θ=Θ∗ to be negative semidefinite, or negative definite, and thus the eigenvalues of M (Θ) to all lie in [0, 1] or [0, 1) respectively [4]. Exceptions to the convergence of the bound optimizer to a local optimum of L(Θ) occur if M (Θ) has eigenvalues whose magnitudes exceed unity. Near a local optimum, this matrix is related to the curvature of the functional G(Θ, Ψ): lim Θt→Θ∗ M (Θ) = − [ ∇G(Θ, Ψ∗) ][ ∇G(Θ) ] −1 (2) where we define the mixed partials and Hessian as: ∇G(Θ, Ψ) ≡ [ ∂G(Θ,Ψ) ∂Θ∂ΨT | Θ = Θ Ψ = Θ ]
منابع مشابه
On the Convergence Analysis of Gravitational Search Algorithm
Gravitational search algorithm (GSA) is one of the newest swarm based optimization algorithms, which has been inspired by the Newtonian laws of gravity and motion. GSA has empirically shown to be an efficient and robust stochastic search algorithm. Since introducing GSA a convergence analysis of this algorithm has not yet been developed. This paper introduces the first attempt to a formal conve...
متن کاملOn the Convergence Analysis of Gravitational Search Algorithm
Gravitational search algorithm (GSA) is one of the newest swarm based optimization algorithms, which has been inspired by the Newtonian laws of gravity and motion. GSA has empirically shown to be an efficient and robust stochastic search algorithm. Since introducing GSA a convergence analysis of this algorithm has not yet been developed. This paper introduces the first attempt to a formal conve...
متن کاملPLASTIC ANALYSIS OF PLANAR FRAMES USING CBO AND ECBO ALGORITHMS
In rigid plastic analysis one of the most widely applicable methods that is based on the minimum principle, is the combination of elementary mechanisms which uses the upper bound theorem. In this method a mechanism is searched which corresponds to the smallest load factor. Mathematical programming can be used to optimize this search process for simple fra...
متن کاملModify the linear search formula in the BFGS method to achieve global convergence.
<span style="color: #333333; font-family: Calibri, sans-serif; font-size: 13.3333px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: justify; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: #ffffff; text-dec...
متن کاملAN IMPROVED INTELLIGENT ALGORITHM BASED ON THE GROUP SEARCH ALGORITHM AND THE ARTIFICIAL FISH SWARM ALGORITHM
This article introduces two swarm intelligent algorithms, a group search optimizer (GSO) and an artificial fish swarm algorithm (AFSA). A single intelligent algorithm always has both merits in its specific formulation and deficiencies due to its inherent limitations. Therefore, we propose a mixture of these algorithms to create a new hybrid optimization algorithm known as the group search-artif...
متن کاملA Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS
Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...
متن کامل